AITopics

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Neural Information Processing SystemsFeb-16-2026, 06:18:04 GMT

835a0185f61867a1ea0f86155489839a-Paper-Conference.pdf

artificial intelligence, lemma, machine learning, (16 more...)

Genre: Research Report > Experimental Study (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Neural Information Processing SystemsFeb-12-2026, 20:23:28 GMT

Regularization Matters: Generalization and Optimization of Neural Nets v.s. their Induced Kernel

Colin Wei, Jason D. Lee, Qiang Liu, Tengyu Ma

Recent works have shown that on sufficiently over-parametrized neural nets, gradient descent with relatively large initialization optimizes a prediction function in the RKHS of the Neural Tangent Kernel (NTK).

artificial intelligence, arxiv preprint arxiv, machine learning, (14 more...)

Country:

North America > United States > California > Santa Clara County > Palo Alto (0.04)
North America > United States > Texas > Travis County > Austin (0.04)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.38)

Neural Information Processing SystemsOct-11-2025, 00:28:38 GMT

Large Stepsize Gradient Descent for Non-Homogeneous Two-Layer Networks: Margin Improvement and Fast Optimization

If the dataset is linearly separable and the derivative of the activation function is bounded away from zero, we show that the average empirical risk decreases, implying that the first phase must stop in finite steps.

lemma, stable phase, two-layer network, (14 more...)

Genre: Research Report > Experimental Study (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.65)

arXiv.org Machine LearningJun-26-2024

Large Stepsize Gradient Descent for Non-Homogeneous Two-Layer Networks: Margin Improvement and Fast Optimization

Cai, Yuhang, Wu, Jingfeng, Mei, Song, Lindsey, Michael, Bartlett, Peter L.

The typical training of neural networks using large stepsize gradient descent (GD) under the logistic loss often involves two distinct phases, where the empirical risk oscillates in the first phase but decreases monotonically in the second phase. We investigate this phenomenon in two-layer networks that satisfy a near-homogeneity condition. We show that the second phase begins once the empirical risk falls below a certain threshold, dependent on the stepsize. Additionally, we show that the normalized margin grows nearly monotonically in the second phase, demonstrating an implicit bias of GD in training non-homogeneous predictors. If the dataset is linearly separable and the derivative of the activation function is bounded away from zero, we show that the average empirical risk decreases, implying that the first phase must stop in finite steps. Finally, we demonstrate that by choosing a suitably large stepsize, GD that undergoes this phase transition is more efficient than GD that monotonically decreases the risk. Our analysis applies to networks of any width, beyond the well-known neural tangent kernel and mean-field regimes.

lemma, stable phase, two-layer network, (14 more...)

2406.08654

Country: North America > United States > New York > New York County > New York City (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.85)

arXiv.org Machine LearningOct-14-2023

Optimal AdaBoost Converges

Snedeker, Conor

The following work is a preprint collection of formal proofs regarding the convergence properties of the AdaBoost machine learning algorithm's classifier and margins. Various math and computer science papers have been written regarding conjectures and special cases of these convergence properties. Furthermore, the margins of AdaBoost feature prominently in the research surrounding the algorithm. At the zenith of this paper we present how AdaBoost's classifier and margins converge on a value that agrees with decades of research. After this, we show how various quantities associated with the combined classifier converge.

adaboost, artificial intelligence, machine learning, (16 more...)

2210.07808

Country:

North America > United States > Ohio (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.30)

Farhang, Alexander R., Bernstein, Jeremy, Tirumala, Kushal, Liu, Yang, Yue, Yisong

Investigating Generalization by Controlling Normalized Margin

arXiv.org Artificial IntelligenceSep-20-2022

Weight norm $\|w\|$ and margin $\gamma$ participate in learning theory via the normalized margin $\gamma/\|w\|$. Since standard neural net optimizers do not control normalized margin, it is hard to test whether this quantity causally relates to generalization. This paper designs a series of experimental studies that explicitly control normalized margin and thereby tackle two central questions. First: does normalized margin always have a causal effect on generalization? The paper finds that no -- networks can be produced where normalized margin has seemingly no relationship with generalization, counter to the theory of Bartlett et al. (2017). Second: does normalized margin ever have a causal effect on generalization? The paper finds that yes -- in a standard training setup, test performance closely tracks normalized margin. The paper suggests a Gaussian process model as a promising explanation for this behavior.

artificial intelligence, generalization, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2205.0394

Country:

North America > United States > Maryland > Baltimore (0.04)
North America > Canada > Ontario > Toronto (0.04)

Genre: Research Report > Experimental Study (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Chuang, Ching-Yao, Mroueh, Youssef, Greenewald, Kristjan, Torralba, Antonio, Jegelka, Stefanie

Measuring Generalization with Optimal Transport

arXiv.org Machine LearningJun-6-2021

Understanding the generalization of deep neural networks is one of the most important tasks in deep learning. Although much progress has been made, theoretical error bounds still often behave disparately from empirical observations. In this work, we develop margin-based generalization bounds, where the margins are normalized with optimal transport costs between independent random subsets sampled from the training distribution. In particular, the optimal transport cost can be interpreted as a generalization of variance which captures the structural properties of the learned feature space. Our bounds robustly predict the generalization error, given training data and network parameters, on large scale datasets. Theoretically, we demonstrate that the concentration and separation of features play crucial roles in generalization, supporting empirical results in the literature. The code is available at \url{https://github.com/chingyaoc/kV-Margin}.

generalization, generalization error, lipschitz constant, (9 more...)

2106.03314

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Asia > Middle East > Jordan (0.04)
Asia > Middle East > Israel (0.04)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.87)

Hopkins, Max, Kane, Daniel M., Lovett, Shachar, Mahajan, Gaurav

Point Location and Active Learning: Learning Halfspaces Almost Optimally

arXiv.org Machine LearningApr-23-2020

Given a finite set $X \subset \mathbb{R}^d$ and a binary linear classifier $c: \mathbb{R}^d \to \{0,1\}$, how many queries of the form $c(x)$ are required to learn the label of every point in $X$? Known as \textit{point location}, this problem has inspired over 35 years of research in the pursuit of an optimal algorithm. Building on the prior work of Kane, Lovett, and Moran (ICALP 2018), we provide the first nearly optimal solution, a randomized linear decision tree of depth $\tilde{O}(d\log(|X|))$, improving on the previous best of $\tilde{O}(d^2\log(|X|))$ from Ezra and Sharir (Discrete and Computational Geometry, 2019). As a corollary, we also provide the first nearly optimal algorithm for actively learning halfspaces in the membership query model. En route to these results, we prove a novel characterization of Barthe's Theorem (Inventiones Mathematicae, 1998) of independent interest. In particular, we show that $X$ may be transformed into approximate isotropic position if and only if there exists no $k$-dimensional subspace with more than a $k/d$-fraction of $X$, and provide a similar characterization for exact isotropic position.

learner, probability, query, (15 more...)

2004.1138

Country: North America > United States > California (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Computational Learning Theory (0.93)

arXiv.org Machine LearningJun-13-2019

Gradient Descent Maximizes the Margin of Homogeneous Neural Networks

Lyu, Kaifeng, Li, Jian

Recent works on implicit regularization have shown that gradient descent converges to the max-margin direction for logistic regression with one-layer or multi-layer linear networks. In this paper, we generalize this result to homogeneous neural networks, including fully-connected and convolutional neural networks with ReLU or LeakyReLU activations. In particular, we study the gradient flow (gradient descent with infinitesimal step size) optimizing the logistic loss or cross-entropy loss of any homogeneous model (possibly non-smooth), and show that if the training loss decreases below a certain threshold, then we can define a smoothed version of the normalized margin which increases over time. We also formulate a natural constrained optimization problem related to margin maximization, and prove that both the normalized margin and its smoothed version converge to the objective value at a KKT point of the optimization problem. Furthermore, we extend the above results to a large family of loss functions. We conduct several experiments to justify our theoretical finding on MNIST and CIFAR-10 datasets. For gradient descent with constant learning rate, we observe that the normalized margin indeed keeps increasing after the dataset is fitted, but the speed is very slow. However, if we schedule the learning rate more carefully, we can observe a more rapid growth of the normalized margin. Finally, as margin is closely related to robustness, we discuss potential benefits of training longer for improving the robustness of the model.

artificial intelligence, machine learning, normalized margin, (17 more...)